13 research outputs found

    Heterogeneity-awareness in multithreaded multicore processors

    Get PDF
    During the last decades, Computer Architecture has experienced a great series of revolutionary changes. The increasing transistor count on a single chip has led to some of the main milestones in the field, from the release of the first Superscalar (1965) to the state-of-the-art Multithreaded Multicore Architectures, like the Intel Core i7 (2009).Moore's Law has continued for almost half of a century and is not expected to stop for at least another decade, and perhaps much longer. Moore observed a trend in the process technology advances. So, the number of transistors that can be placed inexpensively on an integrated circuit has increased exponentially, doubling approximately every two years. Nevertheless, having more available transistors can not be always directly translated into having more performance.The complexity of state-of-the-art software has reached heights unthinkable in prior ages, both in terms of the amount of computation and the complexity involved. If we deeply analyze this complexity in software we would realize that software is comprised of smaller execution processes that, although maintaining certain spatial/temporal locality, imply an inherently heterogeneous behavior. That is, during execution time the hardware executes very different portions of software, with huge differences in terms of behavior and hardware requirements. This heterogeneity in the behaviour of the software is not specific of the latest videogame, but it is inherent to software programming itself, since the very beginning of Algorithmics.In this PhD dissertation we deeply analyze the inherent heterogeneity present in software behavior. We identify the main issues and sources of this heterogeneity, that hamper most of the state-of-the-art processor designs from obtaining their maximum potential. Hence, the heterogeneity in software turns most of the current processors, commonly called general-purpose processors, into overdesigned. That is, they have much more hardware resources than really needed to execute the software running on them. This fact would not represent a main problem if we were not concerned on the additional power consumption involved in software computation.The final goal of this PhD dissertation consists in assigning each portion of software exactly the amount of hardware resources really needed to fully exploit its maximal potential; without consuming more energy than the strictly needed. That is, obtaining complexity-effective executions using the inherent heterogeneity in software behavior as steering indicator. Thus, we start deeply analyzing the heterogenous behaviour of the software run on top of general-purpose processors and then matching it on top of a heterogeneously distributed hardware, which explicitly exploit heterogeneous hardware requirements. Only by being heterogeneity-aware in software, and appropriately matching this software heterogeneity on top of hardware heterogeneity, may we effectively obtain better processor designs.The PhD dissertation is comprised of four main contributions that cover both multithreaded single-core (hdSMT) and multicore (TCA Algorithm, hTCA Framework and MFLUSH) scenarios, deeply explained in their corresponding chapters in the PhD dissertation memory. Overall, these contributions cover a significant range of the Heterogeneity-Aware Processors' design space. Within this design space, we have focused on the state-of-the-art trend in processor design: Multithreaded Multicore (CMP+SMT) Processors.We make special emphasis on the MPsim simulation tool, specifically designed and developed for this PhD dissertation. This tool has already gone beyond this PhD dissertation, becoming a reference tool by an important group of researchers spread over the Computer Architecture Department (DAC) at the Polytechnic University of Catalonia (UPC), the Barcelona Supercomputing Center (BSC) and the University of Las Palmas de Gran Canaria (ULPGC)

    The MPsim simulation tool

    Get PDF
    In order to evaluate novel ideas, computer architects require simulation tools which model a target architecture. According to the specific accuracy requirements we find very specific simulators, which model a single architecture with high accuracy and computational cost, like the ones typically used in the industry, and general purpose simulators with a less accurate model but requiring less computational cost, like the ones typically used in the academia. Focusing on the latter, flexible simulation tools allow evaluating a wide range of system configuration, requiring low effort to evaluate novel ideas. Consequently, the flexibility is a main characteristic to be considered by computer architects when selecting a general-purpose simulation tool. In this paper it is presented a highly-flexible general-purpose simulation tool : the MPsim. It allows simulating a wide range of processor types, both single core (Superscalar, SMT) and multi core (CMP, CMP+SMT), both homogeneous and heterogeneous configurations. It is put special emphasis on the simulator flexibility and how it is obtained. The simulation results included indicate that highflexibility may be obtained without hardly compromising the computational cost in a general purpose simulator.Postprint (published version

    A complexity-effective simultaneous multithreading architecture

    Get PDF
    Different applications may exhibit radically different behaviors and thus have very different requirements in terms of hardware support. In simultaneous multithreading (SMT) architectures, the hardware is shared among multiple running applications in order to better profit from it. However, current architectures are designed for the common case, and try to satisfy a number of different application classes with a single design. That is, current designs are usually overdesigned for most cases, obtaining high performance, but wasting a lot of resources to do so. In this paper we present an alternative SMT architecture, the heterogeneously distributed SMT (hdSMT). Our architecture is based in a novel combination of SMT and clustering techniques in a heterogeneity-aware fashion. The hardware is designed to match the heterogeneous application behavior with the statically and heterogeneously partitioned resources. Such a design is aimed for minimizing the amount of resources wasted to achieve a given performance rate. On top of our statically partitioned architecture, we propose a heuristic policy to map threads to clusters so that each cluster matches the characteristics of the running threads and overall hardware usage is optimized. We compare our hdSMT architecture with a monolithic SMT processor, where all threads compete for the same resources, and with a homogeneous clustered SMT, where resources are statically and equally partitioned across clusters. Our results show that hdSMT architectures obtain an average improvement of 13% and 14% in optimizing performance per area over monolithic SMT and homogeneously clustered SMT respectively.Peer ReviewedPostprint (published version

    Maximizing multithreaded multicore architectures through thread migrations

    Get PDF
    Heterogeneity in general-purpose workloads often end up in non optimal per-thread hardware resource usage. The current trend towards multicore architectures, containing several multithreaded cores, increases the need of a complexity-effective way to expose the heterogeneity in general-purpose workloads to the underlying hardware, in order to obtain all the potential performance of these architectures. In this paper we present the Heterogeneity-Aware Dynamic Thread Migrator (hDTM), a novel complexity-effective hardware mechanism that exposes the heterogeneity in software to the hardware, also enabling the hardware to react to the dynamic behavior variations in the running applications. By means of core-to-core thread migrations, the hDTM mechanism strives to perform the desired behavior transparently to the Operating System. As an example of the general-purpose hDTM concept presented in this paper, we describe a naive hDTM implementation for a Power5-like processor and provide results on the benefits of the proposed mechanism. Our results indicate that even this simple hDTM implementation is able to get close to hDTM’s goal, not only avoiding losses due to bad thread-to-core assignments (up to a 25%) but also going beyond the best static thread-to-core assignment upper limit.Postprint (published version

    The MPsim simulation tool

    No full text
    In order to evaluate novel ideas, computer architects require simulation tools which model a target architecture. According to the specific accuracy requirements we find very specific simulators, which model a single architecture with high accuracy and computational cost, like the ones typically used in the industry, and general purpose simulators with a less accurate model but requiring less computational cost, like the ones typically used in the academia. Focusing on the latter, flexible simulation tools allow evaluating a wide range of system configuration, requiring low effort to evaluate novel ideas. Consequently, the flexibility is a main characteristic to be considered by computer architects when selecting a general-purpose simulation tool. In this paper it is presented a highly-flexible general-purpose simulation tool : the MPsim. It allows simulating a wide range of processor types, both single core (Superscalar, SMT) and multi core (CMP, CMP+SMT), both homogeneous and heterogeneous configurations. It is put special emphasis on the simulator flexibility and how it is obtained. The simulation results included indicate that highflexibility may be obtained without hardly compromising the computational cost in a general purpose simulator

    Some estimation and bias issues in business surveys

    Get PDF
    Available from British Library Document Supply Centre- DSC:DXN064901 / BLDSC - British Library Document Supply CentreSIGLEGBUnited Kingdo

    Thread to core assignment in SMT on-chip multiprocessors

    No full text
    State-of-the-art high-performance processors like the IBM POWER5 and Intel i7 show a trend in industry towards on-chip Multiprocessors (CMP) involving Simultaneous Multithreading (SMT) in each core. In these processors, the way in which applications are assigned to cores plays a key role in the performance of each application and the overall system performance. In this paper we show that the system throughput highly depends on the Thread to Core Assignment (TCA), regardless the SMT Instruction Fetch (IFetch) Policy implemented in the cores. Our results indicate that a good TCA can improve the results of any underlying IFetch Policy, yielding speedups of up to 28%. Given the relevance of TCA, we propose an algorithm to manage it in CMP+SMT processors. The proposed throughput-oriented TCA Algorithm takes into account the workload characteristics and the underlying SMT IFetch Policy. Our results show that the TCA Algorithm obtains thread-to-core assignments 3% close to the optimal assignation for each case, yielding system throughput improvements up to 21%.Peer ReviewedPostprint (published version

    A complexity-effective simultaneous multithreading architecture

    No full text
    Different applications may exhibit radically different behaviors and thus have very different requirements in terms of hardware support. In simultaneous multithreading (SMT) architectures, the hardware is shared among multiple running applications in order to better profit from it. However, current architectures are designed for the common case, and try to satisfy a number of different application classes with a single design. That is, current designs are usually overdesigned for most cases, obtaining high performance, but wasting a lot of resources to do so. In this paper we present an alternative SMT architecture, the heterogeneously distributed SMT (hdSMT). Our architecture is based in a novel combination of SMT and clustering techniques in a heterogeneity-aware fashion. The hardware is designed to match the heterogeneous application behavior with the statically and heterogeneously partitioned resources. Such a design is aimed for minimizing the amount of resources wasted to achieve a given performance rate. On top of our statically partitioned architecture, we propose a heuristic policy to map threads to clusters so that each cluster matches the characteristics of the running threads and overall hardware usage is optimized. We compare our hdSMT architecture with a monolithic SMT processor, where all threads compete for the same resources, and with a homogeneous clustered SMT, where resources are statically and equally partitioned across clusters. Our results show that hdSMT architectures obtain an average improvement of 13% and 14% in optimizing performance per area over monolithic SMT and homogeneously clustered SMT respectively.Peer Reviewe

    MFLUSH: handling long-latency loads in SMT on-chip multiprocessors

    No full text
    Nowadays, there is a clear trend in industry towards employing the growing amount of transistors on chip in replicating execution cores (CMP), where each core is Simultaneous Multithreading (SMT). State-of-the-art high-performance processors like the IBM POWER5 and POWER6 corroborate this CMP+SMT trend. Within each SMT core any of the well-known SMT mechanisms may be applied to face SMT related challenges. Among them, probably the most important issue in an SMT execution pipeline concerns the In-struction Fetch (IFetch) Policy. The FLUSH IFetch Policy represents a choice for throughput-oriented scenarios. It handles L2 cache misses in order to avoid hardware resource monopolization by any given execution Thread; involving an additional energy cost via instruction refetching. However, the new constraints imposed by the CMP+SMT scenario may a ect wellknown SMT mechanisms, like the FLUSH mechanism.Peer Reviewe

    Thread to core assignment in SMT on-chip multiprocessors

    No full text
    State-of-the-art high-performance processors like the IBM POWER5 and Intel i7 show a trend in industry towards on-chip Multiprocessors (CMP) involving Simultaneous Multithreading (SMT) in each core. In these processors, the way in which applications are assigned to cores plays a key role in the performance of each application and the overall system performance. In this paper we show that the system throughput highly depends on the Thread to Core Assignment (TCA), regardless the SMT Instruction Fetch (IFetch) Policy implemented in the cores. Our results indicate that a good TCA can improve the results of any underlying IFetch Policy, yielding speedups of up to 28%. Given the relevance of TCA, we propose an algorithm to manage it in CMP+SMT processors. The proposed throughput-oriented TCA Algorithm takes into account the workload characteristics and the underlying SMT IFetch Policy. Our results show that the TCA Algorithm obtains thread-to-core assignments 3% close to the optimal assignation for each case, yielding system throughput improvements up to 21%.Peer Reviewe
    corecore